Existing text-guided video editing methods often suffer from
temporal inconsistency, motion distortion, and cross-domain
transformation error. We attribute these limitations to
insufficient modeling of spatio-temporal pixel relevance during
the editing process.
To address this, we propose STR-Match, a training-free video
editing technique that produces visually appealing and
temporally coherent videos through latent optimization guided by
our novel STR score. The proposed score captures spatio-temporal
pixel relevance across adjacent frames by leveraging 2D spatial
attention and 1D temporal attention maps in text-to-video~(T2V)
diffusion models, without the overhead of computationally
expensive full 3D attention.
Integrated into a latent optimization framework with a latent
mask, STR-Match generates high-fidelity videos with strong
spatio-temporal consistency, preserving key visual attributes of
the source video while remaining robust under significant domain
shifts. Our extensive experiments demonstrate that STR-Match
consistently outperforms existing methods in both visual quality
and spatio-temporal consistency.